376 research outputs found

    CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers.

    Get PDF
    BackgroundThe problem of supervised DNA sequence classification arises in several fields of computational molecular biology. Although this problem has been extensively studied, it is still computationally challenging due to size of the datasets that modern sequencing technologies can produce.ResultsWe introduce CLARK a novel approach to classify metagenomic reads at the species or genus level with high accuracy and high speed. Extensive experimental results on various metagenomic samples show that the classification accuracy of CLARK is better or comparable to the best state-of-the-art tools and it is significantly faster than any of its competitors. In its fastest single-threaded mode CLARK classifies, with high accuracy, about 32 million metagenomic short reads per minute. CLARK can also classify BAC clones or transcripts to chromosome arms and centromeric regions.ConclusionsCLARK is a versatile, fast and accurate sequence classification method, especially useful for metagenomics and genomics applications. It is freely available at http://clark.cs.ucr.edu/

    Compression of Biological Sequences by Greedy Off-Line Textual Subsitution

    Get PDF

    Efficient and Accurate Detection of Topologically Associating Domains from Contact Maps

    Get PDF
    Continuous improvements to high-throughput conformation capture (Hi-C) are revealing richerinformation about the spatial organization of the chromatin and its role in cellular functions.Several studies have confirmed the existence of structural features of the genome 3D organiza-tion that are stable across cell types and conserved across species, calledtopological associatingdomains(TADs). The detection of TADs has become a critical step in the analysis of Hi-C data,e.g., to identify enhancer-promoter associations. Here we presentEast, a novel TAD identifi-cation algorithm based on fast 2D convolution of Haar-like features, that is as accurate as thestate-of-the-art method based on the directionality index, but 75-80x faster.Eastis availablein the public domain at https://github.com/ucrbioinfo/EAST

    PuFFIN--a parameter-free method to build nucleosome maps from paired-end reads.

    Get PDF
    BackgroundWe introduce a novel method, called PuFFIN, that takes advantage of paired-end short reads to build genome-wide nucleosome maps with larger numbers of detected nucleosomes and higher accuracy than existing tools. In contrast to other approaches that require users to optimize several parameters according to their data (e.g., the maximum allowed nucleosome overlap or legal ranges for the fragment sizes) our algorithm can accurately determine a genome-wide set of non-overlapping nucleosomes without any user-defined parameter. This feature makes PuFFIN significantly easier to use and prevents users from choosing the "wrong" parameters and obtain sub-optimal nucleosome maps.ResultsPuFFIN builds genome-wide nucleosome maps using a multi-scale (or multi-resolution) approach. Our algorithm relies on a set of nucleosome "landscape" functions at different resolution levels: each function represents the likelihood of each genomic location to be occupied by a nucleosome for a particular value of the smoothing parameter. After a set of candidate nucleosomes is computed for each function, PuFFIN produces a consensus set that satisfies non-overlapping constraints and maximizes the number of nucleosomes.ConclusionsWe report comprehensive experimental results that compares PuFFIN with recently published tools (NOrMAL, TEMPLATE FILTERING, and NucPosSimulator) on several synthetic datasets as well as real data for S. cerevisiae and P. falciparum. Experimental results show that our approach produces more accurate nucleosome maps with a higher number of non-overlapping nucleosomes than other tools

    RasBhari: optimizing spaced seeds for database searching, read mapping and alignment-free sequence comparison

    Full text link
    Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de

    A compartmentalized approach to the assembly of physical maps

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Physical maps have been historically one of the cornerstones of genome sequencing and map-based cloning strategies. They also support marker assisted breeding and EST mapping. The problem of building a high quality physical map is computationally challenging due to unavoidable noise in the input fingerprint data.</p> <p>Results</p> <p>We propose a novel compartmentalized method for the assembly of high quality physical maps from fingerprinted clones. The knowledge of genetic markers enables us to group clones into clusters so that clones in the same cluster are more likely to overlap. For each cluster of clones, a local physical map is first constructed using FingerPrinted Contigs (FPC). Then, all the individual maps are carefully merged into the final physical map. Experimental results on the genomes of rice and barley demonstrate that the compartmentalized assembly produces significantly more accurate maps, and that it can detect and isolate clones that would induce "chimeric" contigs if used in the final assembly.</p> <p>Conclusion</p> <p>The software is available for download at <url>http://www.cs.ucr.edu/~sbozdag/assembler/</url></p

    Identification of candidate genes and molecular markers for heat-induced brown discoloration of seed coats in cowpea [Vigna unguiculata (L.) Walp].

    Get PDF
    BackgroundHeat-induced browning (Hbs) of seed coats is caused by high temperatures which discolors the seed coats of many legumes, affecting the visual appearance and quality of seeds. The genetic determinants underlying Hbs in cowpea are unknown.ResultsWe identified three QTL associated with the heat-induced browning of seed coats trait, Hbs-1, Hbs-2 and Hbs-3, using cowpea RIL populations IT93K-503-1 (Hbs positive) x CB46 (hbs negative) and IT84S-2246 (Hbs positive) x TVu14676 (hbs negative). Hbs-1 was identified in both populations, accounting for 28.3% -77.3% of the phenotypic variation. SNP markers 1_0032 and 1_1128 co-segregated with the trait. Within the syntenic regions of Hbs-1 in soybean, Medicago and common bean, several ethylene forming enzymes, ethylene responsive element binding factors and an ACC oxidase 2 were observed. Hbs-1 was identified in a BAC clone in contig 217 of the cowpea physical map, where ethylene forming enzymes were present. Hbs-2 was identified in the IT93K-503-1 x CB46 population and accounted for of 9.5 to 12.3% of the phenotypic variance. Hbs-3 was identified in the IT84S-2246 x TVu14676 population and accounted for 6.2 to 6.8% of the phenotypic variance. SNP marker 1_0640 co-segregated with the heat-induced browning phenotype. Hbs-3 was positioned on BAC clones in contig512 of the cowpea physical map, where several ACC synthase 1 genes were present.ConclusionThe identification of loci determining heat-induced browning of seed coats and co-segregating molecular markers will enable transfer of hbs alleles into cowpea varieties, contributing to higher quality seeds

    Analysis of nucleosome positioning landscapes enables gene discovery in the human malaria parasite Plasmodium falciparum.

    Get PDF
    BackgroundPlasmodium falciparum, the deadliest malaria-causing parasite, has an extremely AT-rich (80.7 %) genome. Because of high AT-content, sequence-based annotation of genes and functional elements remains challenging. In order to better understand the regulatory network controlling gene expression in the parasite, a more complete genome annotation as well as analysis tools adapted for AT-rich genomes are needed. Recent studies on genome-wide nucleosome positioning in eukaryotes have shown that nucleosome landscapes exhibit regular characteristic patterns at the 5'- and 3'-end of protein and non-protein coding genes. In addition, nucleosome depleted regions can be found near transcription start sites. These unique nucleosome landscape patterns may be exploited for the identification of novel genes. In this paper, we propose a computational approach to discover novel putative genes based exclusively on nucleosome positioning data in the AT-rich genome of P. falciparum.ResultsUsing binary classifiers trained on nucleosome landscapes at the gene boundaries from two independent nucleosome positioning data sets, we were able to detect a total of 231 regions containing putative genes in the genome of Plasmodium falciparum, of which 67 highly confident genes were found in both data sets. Eighty-eight of these 231 newly predicted genes exhibited transcription signal in RNA-Seq data, indicative of active transcription. In addition, 20 out of 21 selected gene candidates were further validated by RT-PCR, and 28 out of the 231 genes showed significant matches using BLASTN against an expressed sequence tag (EST) database. Furthermore, 108 (47%) out of the 231 putative novel genes overlapped with previously identified but unannotated long non-coding RNAs. Collectively, these results provide experimental validation for 163 predicted genes (70.6%). Finally, 73 out of 231 genes were found to be potentially translated based on their signal in polysome-associated RNA-Seq representing transcripts that are actively being translated.ConclusionOur results clearly indicate that nucleosome positioning data contains sufficient information for novel gene discovery. As distinct nucleosome landscapes around genes are found in many other eukaryotic organisms, this methodology could be used to characterize the transcriptome of any organism, especially when coupled with other DNA-based gene finding and experimental methods (e.g., RNA-Seq)

    DNA-encoded nucleosome occupancy is associated with transcription levels in the human malaria parasite Plasmodium falciparum.

    Get PDF
    BackgroundIn eukaryotic organisms, packaging of DNA into nucleosomes controls gene expression by regulating access of the promoter to transcription factors. The human malaria parasite Plasmodium falciparum encodes relatively few transcription factors, while extensive nucleosome remodeling occurs during its replicative cycle in red blood cells. These observations point towards an important role of the nucleosome landscape in regulating gene expression. However, the relation between nucleosome positioning and transcriptional activity has thus far not been explored in detail in the parasite.ResultsHere, we analyzed nucleosome positioning in the asexual and sexual stages of the parasite's erythrocytic cycle using chromatin immunoprecipitation of MNase-digested chromatin, followed by next-generation sequencing. We observed a relatively open chromatin structure at the trophozoite and gametocyte stages, consistent with high levels of transcriptional activity in these stages. Nucleosome occupancy of genes and promoter regions were subsequently compared to steady-state mRNA expression levels. Transcript abundance showed a strong inverse correlation with nucleosome occupancy levels in promoter regions. In addition, AT-repeat sequences were strongly unfavorable for nucleosome binding in P. falciparum, and were overrepresented in promoters of highly expressed genes.ConclusionsThe connection between chromatin structure and gene expression in P. falciparum shares similarities with other eukaryotes. However, the remarkable nucleosome dynamics during the erythrocytic stages and the absence of a large variety of transcription factors may indicate that nucleosome binding and remodeling are critical regulators of transcript levels. Moreover, the strong dependency between chromatin structure and DNA sequence suggests that the P. falciparum genome may have been shaped by nucleosome binding preferences. Nucleosome remodeling mechanisms in this deadly parasite could thus provide potent novel anti-malarial targets

    Composition Profiler: a tool for discovery and visualization of amino acid composition differences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Composition Profiler is a web-based tool for semi-automatic discovery of enrichment or depletion of amino acids, either individually or grouped by their physico-chemical or structural properties.</p> <p>Results</p> <p>The program takes two samples of amino acids as input: a query sample and a reference sample. The latter provides a suitable background amino acid distribution, and should be chosen according to the nature of the query sample, for example, a standard protein database (e.g. SwissProt, PDB), a representative sample of proteins from the organism under study, or a group of proteins with a contrasting functional annotation. The results of the analysis of amino acid composition differences are summarized in textual and graphical form.</p> <p>Conclusion</p> <p>As an exploratory data mining tool, our software can be used to guide feature selection for protein function or structure predictors. For classes of proteins with significant differences in frequencies of amino acids having particular physico-chemical (e.g. hydrophobicity or charge) or structural (e.g. α helix propensity) properties, Composition Profiler can be used as a rough, light-weight visual classifier.</p
    • …
    corecore